Interactive Visualization of Soybean Population Genetic Data
Big Data
Visualization is an important tool for working with big data
Adaptations must be made:
- Overplotting (large \(n\))
- High-dimensional data (large \(p\))
- Distributed/multi-source data, hierarchical data
- No solution (binning, dimension reduction, tours) works for every situation
Interactive Graphics
- Provide additional information in response to user action
- Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)
- Accommodate complex data structures
BUT…
Web-based interactive graphics may be even more size-sensitive than static graphics.
Interactive Visualization of Soybean Population Genetic Data
Overall Project Goals:
- Understand historical yield increases
100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity
- Communicate analysis results intuitively:
- Target: Soybean farmers, plant geneticists
- Provide full results (tables) and graphical summaries
- Interface with existing databases and web resources
Data
- Sequencing Data
(79 varieties, 75GB processed and compressed)
- Field Trials
(168 varieties, 30 varieties with genetic data)
- New crosses with highest yield varieties
(sequencing + field trials)
- Genealogy as reported in the breeding literature
(1600 varieties)
Visualizing SNPs
-
SNP: Single Nucleotide Polymorphism
a single basepair mutation (A -> T, G -> A, C -> G)
-
Shiny applet: Responsive applet for user-directed data subsets
-
Show multiple levels of detail (less detail = lower computational load)
-
Provide resources in the applet for user exploration (not just a reference tool)
Visualizing SNPs:
-
Huge number of interesting genes (70 million ID’d SNPs)
Visualizing SNPs:
-
Huge number of interesting genes (70 million ID’d SNPs)
-
79 varieties, 20 chromosomes
-
Phenotype and genealogy information
-
Researchers tend to work on gene subsets:
Must be able to zoom and filter
-
Optimized files for SNP results are still large (10 GB) and require significant computational resources
Above all, need an interface to allow people to pull new discoveries from the data systematically.
Applet Design

SNP Population Distribution

Density of SNPs: Chromosome Level

Individual SNPs: Comparing Varieties

Genealogy and Phenotypes